Machine Learning Approaches to Understanding and Predicting Traffic Accident Severity¶
By- Ibrahim Ahmed Mohammmed (UID:121322005)
Introduction¶
Introduction Every year, nearly 1.35 million people lose their lives in traffic accidents around the world. This heartbreaking reality leaves families devastated, communities shaken, and nations mourning. Road crashes have become the eighth leading cause of death globally, and for young people aged 5–29, they are the number one cause of death. What makes this even more tragic is that many of these deaths could be prevented with better understanding and foresight.
In the U.S. alone, traffic accidents not only cause immeasurable human pain but also come with a staggering economic cost—amounting to hundreds of billions of dollars annually. The most severe accidents contribute significantly to these costs, underscoring the urgent need to focus on preventing them. This is where the power of data and technology can make a life-changing difference. By predicting accidents and understanding the factors that lead to their severity, we might be able to implement well-informed actions and better allocate financial and human resources. This project aims to leverage machine learning to predict the severity of traffic accidents, using advanced models like Neural Networks and XGBoost to uncover patterns that could help prevent future tragedies.
Why This Dataset Matters for Road Safety¶
Impact on Public Safety: Analyzing accidents helps us identify the root causes and risk factors, allowing us to implement preventive measures and reduce the human toll of accidents. Prioritizing safety on our roads is a fundamental moral imperative.
Optimization of Emergency Response: Predicting the severity of accidents enables better resource allocation, ensuring that first responders can arrive at the scene faster and provide appropriate medical treatment. This can potentially save lives and reduce the severity of injuries.
Informing Infrastructure Improvements: By identifying accident-prone areas, transportation authorities can optimize traffic flow, make infrastructure improvements, and implement targeted safety measures. This leads to smoother traffic, shorter commutes, and lower transportation costs.
Why I Chose the Kaggle Dataset: US Accidents (2016 - 2023)¶
Dataset Selection: US Accidents (2016 - 2023)
For this project, we chose the "US Accidents (2016 - 2023)" dataset, which provides a comprehensive record of traffic accidents across the United States between 2016 and 2023. This dataset is invaluable for understanding the patterns, causes, and severity of accidents, as it includes key features such as accident location, weather conditions, road types, vehicle data, and accident severity. These attributes are critical for developing machine learning models that predict accident severity, a central objective of this project.
You can access the dataset via the following link: US Accidents (2016 - 2023).
Importing all important libraries
import pandas as pd
import numpy as np
import re
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import boxcox
import warnings
warnings.filterwarnings("ignore")
Loading the Dataset¶
To begin our analysis, we first load the dataset into our environment. The dataset, stored in a CSV file, is read using Python's pandas library. pandas is a powerful tool for data manipulation and analysis, offering efficient structures and functions to work seamlessly with structured data.
The code snippet demonstrates the process of importing pandas, specifying the dataset file path, and using pd.read_csv() to load the data into a DataFrame for analysis. The first ten rows of the dataset are displayed using data.head(10) to verify successful loading.
This step establishes the foundation for all further data exploration and preprocessing tasks in the project.
# File path (update this to the location of your filtered dataset on your local machine)
file_path = r"D:\602\Accident Dataset\filtered_dataset.csv" # Use a raw string or double backslashes for Windows paths
# Load the dataset
data = pd.read_csv(file_path)
# Display the first 10 rows
data.head(10)
| ID | Source | Severity | Start_Time | End_Time | Start_Lat | Start_Lng | End_Lat | End_Lng | Distance(mi) | ... | Station | Stop | Traffic_Calming | Traffic_Signal | Turning_Loop | Sunrise_Sunset | Civil_Twilight | Nautical_Twilight | Astronomical_Twilight | Year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | A-164 | Source2 | 1 | 2016-02-15 17:22:10 | 2016-02-15 18:07:10 | 41.395805 | -81.935562 | NaN | NaN | 0.00 | ... | False | False | False | False | False | Day | Day | Day | Day | 2016.0 |
| 1 | A-375 | Source2 | 1 | 2016-02-24 07:59:51 | 2016-02-24 08:29:51 | 40.018669 | -81.565704 | NaN | NaN | 0.00 | ... | False | False | False | False | False | Day | Day | Day | Day | 2016.0 |
| 2 | A-961 | Source2 | 1 | 2016-06-22 23:54:48 | 2016-06-23 00:39:48 | 37.750488 | -121.379982 | NaN | NaN | 0.00 | ... | False | False | False | False | False | Night | Night | Night | Night | 2016.0 |
| 3 | A-1391 | Source2 | 1 | 2016-06-27 09:17:06 | 2016-06-27 09:47:06 | 36.831322 | -121.435173 | NaN | NaN | 0.00 | ... | False | False | False | False | False | Day | Day | Day | Day | 2016.0 |
| 4 | A-7852 | Source2 | 1 | 2016-12-20 10:31:49 | 2016-12-20 11:01:49 | 38.454693 | -120.867790 | NaN | NaN | 0.01 | ... | False | False | False | True | False | Day | Day | Day | Day | 2016.0 |
| 5 | A-8644 | Source2 | 1 | 2016-12-26 18:32:07 | 2016-12-26 19:17:07 | 37.752113 | -122.420593 | NaN | NaN | 0.01 | ... | True | False | False | True | False | Night | Night | Night | Night | 2016.0 |
| 6 | A-13036 | Source2 | 1 | 2016-10-21 17:51:00 | 2016-10-21 18:21:00 | 36.981586 | -121.999702 | NaN | NaN | 0.01 | ... | False | False | False | False | False | Day | Day | Day | Day | 2016.0 |
| 7 | A-13037 | Source2 | 1 | 2016-10-21 17:52:42 | 2016-10-21 18:22:42 | 37.726498 | -122.402885 | NaN | NaN | 0.00 | ... | True | False | False | False | False | Day | Day | Day | Day | 2016.0 |
| 8 | A-13053 | Source2 | 1 | 2016-10-21 19:43:19 | 2016-10-21 20:13:19 | 37.946297 | -122.537216 | NaN | NaN | 0.01 | ... | False | False | False | False | False | Night | Night | Night | Day | 2016.0 |
| 9 | A-13380 | Source2 | 1 | 2016-10-24 20:54:32 | 2016-10-24 21:54:32 | 38.440739 | -122.745216 | NaN | NaN | 0.01 | ... | False | False | False | False | False | Night | Night | Night | Night | 2016.0 |
10 rows × 47 columns
Dataset Overview¶
The "US Accidents (2016 - 2023)" dataset comprises provides comprehensive view of traffic accidents across the United States. Here are the key columns and their significance:
- ID: A unique identifier for each accident record.
- Severity: Indicates the impact of the accident on traffic, ranging from 1 (minor impact) to 4 (significant impact).
- Start_Time: The local time when the accident occurred.
- End_Time: The local time when the impact of the accident on traffic flow was dismissed.
- Start_Lat/Start_Lng: GPS coordinates of the accident's start point.
- End_Lat/End_Lng: GPS coordinates of the accident's end point.
- Distance(mi): The length of the road extent affected by the accident in miles.
- Description: A human-provided description of the accident.
- Location: Includes street, city, county, state, zipcode, and country information.
- Timezone: The timezone based on the location of the accident.
- Weather_Condition: Describes the weather conditions at the time of the accident (e.g., rain, snow, fog).
- Temperature(F), Wind_Chill(F), Humidity(%), Pressure(in), Visibility(mi), Wind_Direction, Wind_Speed(mph), Precipitation(in): Weather-related features that provide context for the accident.
- POI Annotations: Indicate the presence of various points of interest (POIs) near the accident location, such as amenities, crossings, junctions, and traffic signals.
- Sunrise_Sunset, Civil_Twilight, Nautical_Twilight, Astronomical_Twilight: Indicate the period of the day based on different twilight definitions.
This dataset provides a rich set of features that allow for a comprehensive analysis of traffic accidents, helping to uncover patterns and factors that contribute to their severity.
print(data.info()) # To view column data types and non-null counts
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1721566 entries, 0 to 1721565 Data columns (total 47 columns): # Column Dtype --- ------ ----- 0 ID object 1 Source object 2 Severity int64 3 Start_Time object 4 End_Time object 5 Start_Lat float64 6 Start_Lng float64 7 End_Lat float64 8 End_Lng float64 9 Distance(mi) float64 10 Description object 11 Street object 12 City object 13 County object 14 State object 15 Zipcode object 16 Country object 17 Timezone object 18 Airport_Code object 19 Weather_Timestamp object 20 Temperature(F) float64 21 Wind_Chill(F) float64 22 Humidity(%) float64 23 Pressure(in) float64 24 Visibility(mi) float64 25 Wind_Direction object 26 Wind_Speed(mph) float64 27 Precipitation(in) float64 28 Weather_Condition object 29 Amenity bool 30 Bump bool 31 Crossing bool 32 Give_Way bool 33 Junction bool 34 No_Exit bool 35 Railway bool 36 Roundabout bool 37 Station bool 38 Stop bool 39 Traffic_Calming bool 40 Traffic_Signal bool 41 Turning_Loop bool 42 Sunrise_Sunset object 43 Civil_Twilight object 44 Nautical_Twilight object 45 Astronomical_Twilight object 46 Year float64 dtypes: bool(13), float64(13), int64(1), object(20) memory usage: 467.9+ MB None
Data Cleaning¶
After loading the dataset, the next crucial step is data cleaning. Data cleaning involves identifying and handling issues such as missing values, duplicates, or inconsistencies in the dataset that may affect the quality of our analysis. This step ensures that the dataset is in good shape and ready for exploration and modeling.
The dataset may contain various types of errors, including:
- Missing values: Rows or columns where data is missing or null.
- Duplicates: Repeated entries that may skew results.
- Inconsistent data: Errors such as incorrect data types, outliers, or irrelevant information.
By cleaning the data, we ensure the dataset is reliable and free from issues that could distort the analysis and predictions. For example, we might:
- Remove or impute missing values.
- Drop duplicate rows.
- Correct inconsistencies and errors in the data.
Data cleaning is an essential step before we proceed with any analysis or machine learning tasks.
Filtering the "Source" Column¶
In this step, we focus on cleaning the "Source" column in the dataset. Upon inspecting the data, we observed that some rows had 'Source' values of 1 and 3, which were not in the same format as 'Source' 2. Since 'Source' 2 contains the most comprehensive and consistent data, we decided to remove rows with 'Source' values 1 and 3 to maintain uniformity and ensure reliable analysis.
import pandas as pd
import matplotlib.pyplot as plt
# Group by 'Source' and 'Severity' and count occurrences
df_source = data.groupby(['Source', 'Severity']).size().reset_index(name='Count')
# Pivot the table to have 'Source' as the index and 'Severity' as columns
df_source_pivot = df_source.pivot(index='Source', columns='Severity', values='Count')
# Plot the stacked bar chart
df_source_pivot.plot(kind='bar', stacked=True, figsize=(12, 8))
plt.title('Severity Count by Sources')
plt.xlabel('Source')
plt.ylabel('Count')
plt.legend(title='Severity')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
To ensure consistency and focus on the most reliable data, we remove rows where 'Source' is 1 or 3
# Remove rows where 'Source' is 1 or 3
data = data[~data['Source'].isin([1, 3])]
# Verify the filtered data
print(f"Rows after filtering: {len(data)}")
print(data['Source'].value_counts())
Rows after filtering: 1721566 Source Source2 1384718 Source1 290024 Source3 46824 Name: count, dtype: int64
Dropping Unnecessary Columns and Handling Missing Values¶
In this step, we focused on improving the dataset by removing unnecessary columns and addressing missing values. We started by identifying columns that were not relevant for our analysis, such as "Airport_Code", "Description", and "Street", among others. These columns were dropped from the dataset to reduce its complexity and focus on the essential features.
# List of columns to drop
columns_to_drop = [
'Airport_Code', 'Description',
'Civil_Twilight', 'Nautical_Twilight', 'Astronomical_Twilight',
'ID', 'Street', 'Timezone', 'Country', 'End_Lat', 'End_Lng', 'Source','Turning_Loop',"Distance(mi)"
'End_Time', 'City', 'County',
]
# Check which columns exist in the dataset
existing_columns_to_drop = [col for col in columns_to_drop if col in data.columns]
# Drop only the existing columns
data = data.drop(columns=existing_columns_to_drop)
# Check if 'Precipitation(in)' column exists
if 'Precipitation(in)' not in data.columns:
raise ValueError("The 'Precipitation(in)' column must exist in the dataset.")
# Create a new feature for missing values in 'Precipitation(in)'
data['Precipitation_NA'] = 0
data.loc[data['Precipitation(in)'].isnull(), 'Precipitation_NA'] = 1
# Replace missing values in 'Precipitation(in)' with the median
median_precipitation = data['Precipitation(in)'].median()
data['Precipitation(in)'] = data['Precipitation(in)'].fillna(median_precipitation)
print("Updated dataset with new feature for missing values in 'Precipitation(in)':")
print(data[['Precipitation(in)']].head())
print("Columns dropped. Updated dataset shape:", data.shape)
Updated dataset with new feature for missing values in 'Precipitation(in)': Precipitation(in) 0 0.00 1 0.08 2 0.00 3 0.00 4 0.00 Columns dropped. Updated dataset shape: (1721566, 33)
After dropping the unnecessary columns, we addressed the missing values in the "Precipitation(in)" column. If any values were missing in this column, we created a new binary feature, Precipitation_NA, to indicate whether the value was missing (1) or not (0). We then filled the missing values in the "Precipitation(in)" column with the median value to maintain the consistency of the data.
Removing empty rows¶
Handling Missing Values Handling missing values is a crucial step in data cleaning to ensure the quality and reliability of the dataset. In this step, we focus on specific columns that are essential for our analysis and remove rows where these columns have missing values.
# Drop rows where any of the specified columns have missing values
columns_to_check = [
'Zipcode','Sunrise_Sunset',"Distance(mi)"
]
existing_columns_to_check = [col for col in columns_to_check if col in data.columns]
# Drop rows with NaN in the existing columns
data = data.dropna(subset=existing_columns_to_check)
print(f"Rows with missing values in {existing_columns_to_check} removed. Updated shape: {data.shape}")
Rows with missing values in ['Zipcode', 'Sunrise_Sunset', 'Distance(mi)'] removed. Updated shape: (1717605, 33)
Handling Missing Weather Data¶
To address missing values in the weather-related columns ("Temperature(F)", "Humidity(%)", "Pressure(in)", "Visibility(mi)", and "Wind_Speed(mph)"), we used a method that fills the missing data based on the median values of specific groups. The groups were defined by "State" and "Start_Month" to account for regional and seasonal variations in weather patterns.
First, the "Start_Month" column was derived from the "Start_Time" column, which was converted to a datetime format to extract the month. The missing values in the weather columns were then filled using the median value within each "State" and "Start_Month" group. This approach ensures that the imputed values are contextually appropriate, based on the location and time of year.
# Ensure the necessary columns exist in the dataset
weather_columns = ['Temperature(F)', 'Humidity(%)', 'Pressure(in)', 'Visibility(mi)', 'Wind_Speed(mph)']
grouping_columns = ['State', 'Start_Month']
data['Start_Month'] = pd.to_datetime(data['Start_Time']).dt.month
existing_weather_columns = [col for col in weather_columns if col in data.columns]
existing_grouping_columns = [col for col in grouping_columns if col in data.columns]
if len(existing_grouping_columns) < 2:
raise ValueError("Both 'State' and 'Start_Month' columns must exist in the dataset.")
# Replace missing values in weather features by grouping by 'State' and 'Start_Month'
for col in existing_weather_columns:
data[col] = data.groupby(existing_grouping_columns)[col].transform(lambda x: x.fillna(x.median()))
# Check for remaining missing values in the weather columns
missing_values = data[existing_weather_columns].isna().sum()
print(f"Remaining missing values in weather features:\n{missing_values}")
Remaining missing values in weather features: Temperature(F) 0 Humidity(%) 0 Pressure(in) 0 Visibility(mi) 0 Wind_Speed(mph) 0 dtype: int64
Extracting Datetime Components¶
Extracting datetime components from the 'Start_Time' column is essential for detailed temporal analysis. This process helps us understand the distribution of accidents over different time periods, such as years, months, and days of the week.Fixing date time
# Convert 'Start_Time' to datetime if it is not already
data['Start_Time'] = pd.to_datetime(data['Start_Time'])
# Extract datetime components from 'Start_Time'
data['Year'] = data['Start_Time'].dt.year
data['Month'] = data['Start_Time'].dt.month
data['Weekday'] = data['Start_Time'].dt.weekday
# Calculate the day of the year
days_each_month = np.cumsum(np.array([0, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]))
nmonth = data['Start_Time'].dt.month
nday = [days_each_month[arg - 1] for arg in nmonth.values]
nday = nday + data["Start_Time"].dt.day.values
data['Day'] = nday
data['Hour'] = data['Start_Time'].dt.hour
data['Minute'] = data['Hour'] * 60.0 + data["Start_Time"].dt.minute
print(data.loc[:4, ['Start_Time', 'Year', 'Month', 'Weekday', 'Day', 'Hour', 'Minute']])
Start_Time Year Month Weekday Day Hour Minute 0 2016-02-15 17:22:10 2016 2 0 46 17 1042.0 1 2016-02-24 07:59:51 2016 2 2 55 7 479.0 2 2016-06-22 23:54:48 2016 6 2 173 23 1434.0 3 2016-06-27 09:17:06 2016 6 0 178 9 557.0 4 2016-12-20 10:31:49 2016 12 1 354 10 631.0
Simplifying Wind Direction¶
Data cleaning is crucial for maintaining the integrity and reliability of our dataset. We start by identifying essential weather-related and grouping columns, ensuring they exist in the dataset. We extract the month from 'Start_Time' to create a new column 'Start_Month'. Missing values in weather features are replaced by grouping the data by 'State' and 'Start_Month' and filling with the median value. We also simplify wind direction categories for easier analysis. These steps ensure our dataset is clean and ready for accurate analysis
weather_columns = ['Temperature(F)', 'Humidity(%)', 'Pressure(in)', 'Visibility(mi)', 'Wind_Speed(mph)']
grouping_columns = ['State', 'Start_Month']
data['Start_Month'] = pd.to_datetime(data['Start_Time']).dt.month
# Check if all required columns exist in the dataset
existing_weather_columns = [col for col in weather_columns if col in data.columns]
existing_grouping_columns = [col for col in grouping_columns if col in data.columns]
if len(existing_grouping_columns) < 2:
raise ValueError("Both 'State' and 'Start_Month' columns must exist in the dataset.")
# Replace missing values in weather features by grouping by 'State' and 'Start_Month'
for col in existing_weather_columns:
data[col] = data.groupby(existing_grouping_columns)[col].transform(lambda x: x.fillna(x.median()))
# Check for remaining missing values in the weather columns
missing_values = data[existing_weather_columns].isna().sum()
print(f"Remaining missing values in weather features:\n{missing_values}")
# Simplify 'Wind_Direction'
if 'Wind_Direction' in data.columns:
data.loc[data['Wind_Direction'] == 'Calm', 'Wind_Direction'] = 'CALM'
data.loc[(data['Wind_Direction'] == 'West') | (data['Wind_Direction'] == 'WSW') | (data['Wind_Direction'] == 'WNW'), 'Wind_Direction'] = 'W'
data.loc[(data['Wind_Direction'] == 'South') | (data['Wind_Direction'] == 'SSW') | (data['Wind_Direction'] == 'SSE'), 'Wind_Direction'] = 'S'
data.loc[(data['Wind_Direction'] == 'North') | (data['Wind_Direction'] == 'NNW') | (data['Wind_Direction'] == 'NNE'), 'Wind_Direction'] = 'N'
data.loc[(data['Wind_Direction'] == 'East') | (data['Wind_Direction'] == 'ESE') | (data['Wind_Direction'] == 'ENE'), 'Wind_Direction'] = 'E'
data.loc[data['Wind_Direction'] == 'Variable', 'Wind_Direction'] = 'VAR'
# Print unique wind directions after simplification
print("Wind Direction after simplification: ", data['Wind_Direction'].unique())
else:
print("The 'Wind_Direction' column does not exist in the dataset.")
Remaining missing values in weather features: Temperature(F) 0 Humidity(%) 0 Pressure(in) 0 Visibility(mi) 0 Wind_Speed(mph) 0 dtype: int64 Wind Direction after simplification: ['S' 'E' 'W' 'CALM' 'SE' 'NW' 'N' 'NE' nan 'VAR' 'SW']
Creating Weather Condition Features¶
To enhance our analysis, we need to transform the 'Weather_Condition' column into more granular and meaningful features. This step involves identifying distinctive weather conditions and creating binary features for common weather types.
# Show distinctive weather conditions
weather_conditions = '!'.join(data['Weather_Condition'].dropna().unique().tolist())
weather_conditions = np.unique(np.array(re.split(
"!|\s/\s|\sand\s|\swith\s|Partly\s|Mostly\s|Blowing\s|Freezing\s", weather_conditions))).tolist()
print("Weather Conditions: ", weather_conditions)
# Create features for some common weather conditions
data['Clear'] = np.where(data['Weather_Condition'].str.contains('Clear', case=False, na=False), True, False)
data['Cloud'] = np.where(data['Weather_Condition'].str.contains('Cloud|Overcast', case=False, na=False), True, False)
data['Rain'] = np.where(data['Weather_Condition'].str.contains('Rain|storm', case=False, na=False), True, False)
data['Heavy_Rain'] = np.where(data['Weather_Condition'].str.contains('Heavy Rain|Rain Shower|Heavy T-Storm|Heavy Thunderstorms', case=False, na=False), True, False)
data['Snow'] = np.where(data['Weather_Condition'].str.contains('Snow|Sleet|Ice', case=False, na=False), True, False)
data['Heavy_Snow'] = np.where(data['Weather_Condition'].str.contains('Heavy Snow|Heavy Sleet|Heavy Ice Pellets|Snow Showers|Squalls', case=False, na=False), True, False)
data['Fog'] = np.where(data['Weather_Condition'].str.contains('Fog', case=False, na=False), True, False)
# Assign NA to created weather features where 'Weather_Condition' is null
weather_features = ['Clear', 'Cloud', 'Rain', 'Heavy_Rain', 'Snow', 'Heavy_Snow', 'Fog']
for feature in weather_features:
data.loc[data['Weather_Condition'].isnull(), feature] = np.nan
data[feature] = data[feature].astype('bool')
print(data[['Weather_Condition'] + weather_features].head())
# Drop the original 'Weather_Condition' column
data = data.drop(['Weather_Condition'], axis=1)
Weather Conditions: ['', 'Clear', 'Cloudy', 'Drifting Snow', 'Drizzle', 'Dust', 'Dust Whirlwinds', 'Duststorm', 'Fair', 'Fog', 'Funnel Cloud', 'Hail', 'Haze', 'Heavy ', 'Heavy Drizzle', 'Heavy Ice Pellets', 'Heavy Rain', 'Heavy Rain Showers', 'Heavy Sleet', 'Heavy Smoke', 'Heavy Snow', 'Heavy T-Storm', 'Heavy Thunderstorms', 'Ice Pellets', 'Light ', 'Light Drizzle', 'Light Fog', 'Light Hail', 'Light Haze', 'Light Ice Pellets', 'Light Rain', 'Light Rain Shower', 'Light Rain Showers', 'Light Sleet', 'Light Snow', 'Light Snow Grains', 'Light Snow Shower', 'Light Snow Showers', 'Light Thunderstorms', 'Low Drifting Snow', 'Mist', 'N/A Precipitation', 'Overcast', 'Partial Fog', 'Patches of Fog', 'Rain', 'Rain Shower', 'Rain Showers', 'Sand', 'Scattered Clouds', 'Shallow Fog', 'Showers in the Vicinity', 'Sleet', 'Small Hail', 'Smoke', 'Snow', 'Snow Grains', 'Snow Showers', 'Squalls', 'T-Storm', 'Thunder', 'Thunder in the Vicinity', 'Thunderstorm', 'Thunderstorms', 'Tornado', 'Volcanic Ash', 'Widespread Dust', 'Windy', 'Wintry Mix'] Weather_Condition Clear Cloud Rain Heavy_Rain Snow Heavy_Snow Fog 0 Overcast False True False False False False False 1 Light Rain False False True False False False False 2 Clear True False False False False False False 3 Clear True False False False False False False 4 Clear True False False False False False False
Remove empty rows¶
After creating new columns and transforming the dataset, we proceed to clean up any remaining empty rows. To ensure that our analysis only includes complete records, we use the dropna() method to remove any rows that still contain missing values across any column. This is an important step to make sure that our dataset is free from incomplete data that could affect the accuracy of the analysis.
# Drop rows with any missing values (including NaN) in any column
data = data.dropna()
# Verify the changes
print(f"Rows after dropping NAs in all columns: {len(data)}")
Rows after dropping NAs in all columns: 1139279
data.head()
| Severity | Start_Time | End_Time | Start_Lat | Start_Lng | Distance(mi) | State | Zipcode | Weather_Timestamp | Temperature(F) | ... | Day | Hour | Minute | Clear | Cloud | Rain | Heavy_Rain | Snow | Heavy_Snow | Fog | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2016-02-15 17:22:10 | 2016-02-15 18:07:10 | 41.395805 | -81.935562 | 0.0 | OH | 44070-5152 | 2016-02-15 17:51:00 | 33.1 | ... | 46 | 17 | 1042.0 | False | True | False | False | False | False | False |
| 1 | 1 | 2016-02-24 07:59:51 | 2016-02-24 08:29:51 | 40.018669 | -81.565704 | 0.0 | OH | 43725 | 2016-02-24 07:53:00 | 46.0 | ... | 55 | 7 | 479.0 | False | False | True | False | False | False | False |
| 86 | 1 | 2016-08-29 20:21:20 | 2016-08-29 21:21:20 | 34.438126 | -118.394073 | 0.0 | CA | 91387 | 2016-08-29 20:51:00 | 82.0 | ... | 241 | 20 | 1221.0 | False | False | False | False | False | False | False |
| 91 | 1 | 2016-04-21 10:23:04 | 2016-04-21 10:53:04 | 34.274918 | -118.690063 | 0.0 | CA | 93063 | 2016-04-21 10:51:00 | 78.0 | ... | 111 | 10 | 623.0 | False | True | False | False | False | False | False |
| 96 | 1 | 2016-05-28 18:20:28 | 2016-05-28 19:05:28 | 34.422363 | -118.579720 | 0.0 | CA | 91355-4987 | 2016-05-28 17:51:00 | 71.0 | ... | 148 | 18 | 1100.0 | False | False | False | False | False | False | False |
5 rows × 45 columns
Exploratory Data Analysis (EDA) and Visualization¶
Exploratory Data Analysis (EDA) is a crucial step in understanding the underlying patterns, distributions, and relationships within the dataset. Visualization plays a key role in EDA by providing graphical representations that make it easier to interpret complex data. Through EDA and visualization, we can uncover insights, identify trends, and detect anomalies that might not be apparent from raw data alone.
In this section, we will explore the distribution of accident severity, which is a critical factor in understanding the impact of accidents on traffic. By visualizing this distribution, we can gain insights into how frequently different levels of accident severity occur and set the stage for more detailed investigations.Analysis and findings
Moving forward, our primary focus will be on comparing Severity 4 (the most severe accidents) against all other severity levels. This focus is driven by the fact that understanding and mitigating severe accidents are crucial for improving road safety and addressing the most impactful incidents. By analyzing Severity 4 in relation to other accident severities, we can identify the key factors contributing to these critical events and prioritize safety interventions effectively.
By examining the distribution of 'Severity', we gain insights into how frequently different levels of accident severity occur. The countplot visualizes this distribution, showing that severity level 2 is the most common, followed by levels 3, 4, and 1.
# Check the distribution of 'Severity'
print("Distribution of 'Severity' in the dataset:")
print(data['Severity'].value_counts())
# Plotting the distribution of Severity
plt.figure(figsize=(8, 6))
sns.countplot(data=data, x='Severity', palette='viridis')
plt.title('Distribution of Severity in Dataset', fontsize=14)
plt.xlabel('Severity', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
plt.tight_layout()
plt.show()
Distribution of 'Severity' in the dataset: Severity 2 546547 3 410654 4 117106 1 64972 Name: count, dtype: int64
This initial analysis helps us understand the overall impact of accidents on traffic and sets the stage for more detailed investigations into the factors contributing to different severity levels.
# Count the occurrences of each severity level in the original data
severity_counts = data['Severity'].value_counts()
# Print the severity counts
print(severity_counts)
Severity 2 546547 3 410654 4 117106 1 64972 Name: count, dtype: int64
Analyzing Severity by Period Features¶
Using custom colors for 'Severity 4' and 'Other Severity', we create subplots for each period feature. The countplots visualize the distribution of accident severity across different months and weekdays, helping us identify any temporal patterns.
# Map severity 1, 2, 3 to 'Other Severity' and 4 to 'Severity 4'
data['SeverityGroup'] = data['Severity'].apply(lambda x: 'Severity 4' if x == 4 else 'Other Severity')
# Period features to analyze
period_features = ['Month', 'Weekday']
# Define custom colors for 'Severity 4' and 'Other Severity'
custom_palette = {'Severity 4': 'orange', 'Other Severity': 'skyblue'}
# Create subplots for each period feature
fig, axs = plt.subplots(ncols=len(period_features), nrows=1, figsize=(13, 5)) # Adjust layout based on number of features
plt.subplots_adjust(wspace=0.5)
# Loop through period features and plot
for i, feature in enumerate(period_features, 1):
plt.subplot(1, len(period_features), i) # Adjust subplots to match the number of features
sns.countplot(x=feature, hue='SeverityGroup', data=data, palette=custom_palette)
plt.xlabel('{}'.format(feature), size=12, labelpad=3)
plt.ylabel('Accident Count', size=12, labelpad=3)
plt.tick_params(axis='x', labelsize=12)
plt.tick_params(axis='y', labelsize=12)
plt.legend(title='Severity', loc='upper right', prop={'size': 10})
plt.title('Count of Severity in\n{} Feature'.format(feature), size=13, y=1.05)
# Add title for the entire figure
fig.suptitle('Count of Accidents by Month and Weekday (Severity Analysis)', y=1.08, fontsize=16)
plt.show()
Key Points from the Plot: Count of Accidents by Month and Weekday¶
Count of Severity in Month Feature Seasonal Increase: The number of accidents, both Severity 4 and Other Severity, shows a noticeable increase towards the end of the year, particularly from October to December. This trend suggests that seasonal factors, such as weather conditions or holiday travel, may contribute to higher accident rates during these months.
Consistent Severity Ratio: Although the overall accident count increases towards the year's end, the proportion of Severity 4 accidents relative to Other Severity remains relatively consistent. This indicates that while the total number of accidents rises, the severity distribution stays stable.
Count of Severity in Weekday Feature Weekday Consistency: The count of accidents remains relatively consistent throughout the weekdays (Monday to Friday). This consistency suggests that daily commuting and regular weekday activities contribute steadily to accident occurrences.
Weekend Decline: There is a slight decrease in the number of accidents on weekends (Saturday and Sunday) for both severity levels. This decline could be attributed to reduced traffic volume and different driving patterns during weekends compared to weekdays.
Analyzing Severity by Sunrise/Sunset and Hour¶
To further understand the distribution of accident severity, we analyze it across different period features such as 'Sunrise_Sunset' and 'Hour'. This helps us identify any temporal patterns in accident severity related to the time of day and lighting conditions.
# Map severity 1, 2, 3 to 'Other Severity' and 4 to 'Severity 4'
data['SeverityGroup'] = data['Severity'].apply(lambda x: 'Severity 4' if x == 4 else 'Other Severity')
# Period features to analyze
period_features = ['Sunrise_Sunset', 'Hour']
# Define custom colors for 'Severity 4' and 'Other Severity'
custom_palette = {'Severity 4': 'orange', 'Other Severity': 'skyblue'}
# Create subplots for each period feature
fig, axs = plt.subplots(ncols=len(period_features), nrows=1, figsize=(13, 5)) # Adjust layout based on number of features
plt.subplots_adjust(wspace=0.5)
# Loop through period features and plot
for i, feature in enumerate(period_features, 1):
plt.subplot(1, len(period_features), i) # Adjust subplots to match the number of features
sns.countplot(x=feature, hue='SeverityGroup', data=data, palette=custom_palette)
plt.xlabel('{}'.format(feature), size=12, labelpad=3)
plt.ylabel('Accident Count', size=12, labelpad=3)
plt.tick_params(axis='x', labelsize=12)
plt.tick_params(axis='y', labelsize=12)
plt.legend(title='Severity', loc='upper right', prop={'size': 10})
plt.title('Count of Severity in\n{} Feature'.format(feature), size=13, y=1.05)
# Add title for the entire figure
fig.suptitle('Count of Accidents by Sunrise/Sunset and Hour (Severity Analysis)', y=1.08, fontsize=16)
plt.show()
Key Points from the Plot: Count of Accidents by Sunrise/Sunset and Hour (Severity Analysis)¶
Count of Severity in Sunrise_Sunset Feature¶
Higher Accident Counts During the Day: The plot shows that the majority of accidents occur during the daytime, with a significantly higher count compared to nighttime. This trend is observed for both Severity 4 and Other Severity accidents.
- Other Severity: Daytime accidents are much more frequent, indicating that daytime conditions, such as higher traffic volume and visibility, contribute to a higher number of less severe accidents.
- Severity 4: Although less frequent, severe accidents also occur more often during the daytime, suggesting that daytime factors may still play a role in severe accident occurrences.
Nighttime Accidents: While the overall accident count is lower at night, the proportion of Severity 4 accidents relative to Other Severity appears to be slightly higher compared to daytime. This could indicate that nighttime conditions, such as reduced visibility and potentially higher speeds, contribute to more severe accidents.
Count of Severity in Hour Feature¶
Peak Accident Hours: The plot reveals that accident counts peak during specific hours of the day. Notably, there are significant spikes in accident counts during the morning (around 8 AM) and evening (around 5 PM) rush hours.
- Morning Rush Hour: The high accident count around 8 AM suggests that morning commute traffic contributes to a higher number of accidents, both for Severity 4 and Other Severity.
- Evening Rush Hour: Similarly, the peak around 5 PM indicates that evening commute traffic also leads to a higher number of accidents.
Severity Distribution Throughout the Day:
- Consistent Severity Patterns: While the overall accident count fluctuates, the proportion of Severity 4 accidents relative to Other Severity remains relatively consistent across different hours.
- Late-Night Accidents: There is a noticeable, albeit smaller, number of accidents during late-night hours (between 12 AM and 4 AM), with a slightly higher proportion of Severity 4 accidents. This could be due to factors such as fatigue, reduced visibility, or different driving behaviors during these hours.
Geo Analysis of Accident Locations¶
To gain a comprehensive understanding of where accidents occur and their severity levels, we will conduct a geo analysis. This involves mapping all accidents categorized by severity levels to identify geographic hotspots and understand the spatial distribution of accident severity.
We start by plotting all accidents from the original dataset on a map, categorized by their severity levels. This visualization helps us identify areas with high accident frequencies and understand the distribution of accident severity across different locations.
# Plotting all accidents categorized by severity levels
plt.figure(figsize=(15, 10))
# Plot all accidents from the original data by severity levels
for severity in range(1, 5):
severity_data = data[data['Severity'] == severity]
plt.plot('Start_Lng', 'Start_Lat', data=severity_data, linestyle='', marker='o', markersize=2.5,
label=f'Severity {severity}', alpha=0.3)
# Add a legend, adjust the marker scale for better visibility
plt.legend(markerscale=6, loc='upper right')
plt.xlabel('Longitude', size=12, labelpad=3)
plt.ylabel('Latitude', size=12, labelpad=3)
plt.title('Map of Accidents by Severity (Original Data)', size=16, y=1.05)
plt.show()
# Count total accidents in the original data
total_accidents = len(data)
# Count accidents by severity levels
severity_counts_data = data.groupby('Severity').size()
print(f"Total Accidents in Original Data: {total_accidents}")
for severity in range(1, 5):
level_accidents = severity_counts_data.get(severity, 0)
print(f"Level {severity} Accidents in Original Data: {level_accidents}")
Total Accidents in Original Data: 1139279 Level 1 Accidents in Original Data: 64972 Level 2 Accidents in Original Data: 546547 Level 3 Accidents in Original Data: 410654 Level 4 Accidents in Original Data: 117106
Geographic Distribution:
High-Density Areas: The map shows a high concentration of accidents in urban areas and along major highways. This is evident in regions such as the East Coast, particularly around major cities like New York, and the West Coast, especially in California.
Rural vs. Urban: Urban areas exhibit a higher density of accidents compared to rural areas. This can be attributed to higher traffic volumes and more complex traffic patterns in urban settings.
Hotspots Identification:
Major Cities: Cities like New York, Los Angeles, and Chicago show significant clusters of accidents, indicating these areas may benefit from enhanced traffic management and safety measures.
Highways and Interstates: Major highways and interstates, such as I-95 along the East Coast and I-5 on the West Coast, are hotspots for accidents. These areas may require specific interventions like improved signage, better road maintenance, and stricter enforcement of traffic laws.
Regional Trends:
East Coast: The East Coast, particularly the Northeast, shows a high density of accidents, likely due to the high population density and complex road networks.
West Coast: The West Coast, especially California, also exhibits a high concentration of accidents, which can be attributed to the state's large population and extensive highway system.
# Identify the top 5 states with the highest number of accidents
top_5_states = data['State'].value_counts().head(5).index
# Filter the dataset for the top 5 states
top_5_data = data[data['State'].isin(top_5_states)]
print(top_5_states)
Index(['CA', 'FL', 'TX', 'SC', 'NY'], dtype='object', name='State')
In this part of the analysis, we generated heatmaps to visualize the concentration of accidents across the top 5 states with the highest accident rates. By focusing on the latitude and longitude of each accident, we were able to create a geographic representation that highlights accident-prone areas.
Areas with higher accident frequencies are shown as areas with more intense color, providing a clear visualization of accident-dense regions. Conversely, areas with fewer accidents are displayed with less intense coloring.
import folium
from folium.plugins import HeatMap
# Function to create a heatmap for a given state
def create_heatmap(state_data, state_name):
# Create a base map centered around the state
m = folium.Map(location=[state_data['Start_Lat'].mean(), state_data['Start_Lng'].mean()], zoom_start=7)
# Add heatmap layer
heat_data = state_data[['Start_Lat', 'Start_Lng']].values.tolist()
HeatMap(heat_data).add_to(m)
# Set the title of the map
m.save(f"{state_name}_heatmap.html")
return m
# Create and display heatmaps for each of the top 5 states
for state in top_5_states:
state_data = top_5_data[top_5_data['State'] == state]
heatmap = create_heatmap(state_data, state)
display(heatmap)
Geo Analysis of Weather Features¶
To conduct a thorough exploratory data analysis (EDA) on weather-related features, we aim to examine how various weather conditions influence the severity of accidents. We will focus on continuous weather variables such as Temperature (°F), Humidity (%), Pressure (in), Visibility (mi), and Wind Speed (mph), along with their relationship to the Severity of accidents
While we focus on weather-related features for visualization, it's important to also handle any missing or incorrect values. We imputed missing values for the continuous variables using the median of each feature, which is a robust approach to filling in gaps without introducing biases. Alternatively, categorical features might have been imputed with the mode (most frequent category) if necessary.
Weather-Based Clustering of Accidents¶
To gain deeper insights into how different weather conditions impact accident occurrences, we performed a clustering analysis using the KMeans algorithm. This approach helps us identify distinct groups of accidents based on weather features and understand the characteristics of each cluster.
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# Select weather features for clustering
weather_features = top_5_data[['Temperature(F)', 'Humidity(%)', 'Visibility(mi)', 'Wind_Speed(mph)']]
# Apply KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=0).fit(weather_features)
top_5_data['Weather_Cluster'] = kmeans.labels_
# Visualize the clusters
plt.figure(figsize=(10, 8))
sns.scatterplot(x='Start_Lng', y='Start_Lat', hue='Weather_Cluster', data=top_5_data, palette='viridis')
plt.title('Weather-Based Clustering of Accidents')
plt.show()
weather_features = top_5_data[['Temperature(F)', 'Humidity(%)', 'Visibility(mi)', 'Wind_Speed(mph)']]
# Apply KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=0).fit(weather_features)
top_5_data['Weather_Cluster'] = kmeans.labels_
# Get the cluster centers
cluster_centers = pd.DataFrame(kmeans.cluster_centers_, columns=weather_features.columns)
cluster_centers['Cluster'] = cluster_centers.index
# Display the cluster centers
print("Cluster Centers:\n", cluster_centers)
Cluster Centers:
Temperature(F) Humidity(%) Visibility(mi) Wind_Speed(mph) Cluster
0 71.801736 80.376766 8.953949 6.533223 0
1 71.154934 40.493593 9.867444 8.008892 1
2 45.279198 80.000719 8.229593 6.306972 2
Cluster 0: Warm temperatures (72°F), high humidity (80%), and moderate visibility (8.95 miles) with low wind speeds suggest accidents occur in humid, moderate conditions, possibly during foggy or rainy weather.
Cluster 1: Mild temperatures (71°F), lower humidity (40%), and higher visibility (9.87 miles) with slightly higher wind speeds indicate accidents are more likely in dry, clearer conditions.
Cluster 2: Cold temperatures (45°F), high humidity (80%), and lower visibility (8.23 miles) with low wind speeds suggest accidents occur in colder, damp conditions, possibly linked to icy or slippery roads.
Overall, colder, more humid conditions (Cluster 2) and moderate visibility (Cluster 0) are associated with higher accident risk, while dry, clear conditions (Cluster 1) seem to have lower weather-related risks.
Visualizing Weather Features¶
To handle skewed continuous features, we applied the Box-Cox transformation. This transformation helps in normalizing the data, making it more suitable for analysis. We shifted the values to avoid zeros and negatives, which are not permissible in the Box-Cox transformation.
# Apply Box-Cox transformation to skewed continuous features (with shifted values to avoid zeros/negatives)
data['Pressure_bc'] = boxcox(data['Pressure(in)'].apply(lambda x: x + 1), lmbda=6)
data['Visibility_bc'] = boxcox(data['Visibility(mi)'].apply(lambda x: x + 1), lmbda=0.1)
data['Wind_Speed_bc'] = boxcox(data['Wind_Speed(mph)'].apply(lambda x: x + 1), lmbda=-0.2)
# Define numerical features for visualization
num_features = ['Temperature(F)', 'Humidity(%)','Visibility_bc', 'Wind_Speed_bc']
# Loop over numerical features to create separate box plots for each
for feature in num_features:
plt.figure(figsize=(8, 6)) # Create a new figure for each feature
sns.boxplot(x="Severity", y=feature, data=data, palette="Set2") # Boxplot for each feature
# Set plot labels and title
plt.xlabel('Severity', size=12, labelpad=3)
plt.ylabel('{}'.format(feature), size=12, labelpad=3)
plt.tick_params(axis='x', labelsize=12)
plt.tick_params(axis='y', labelsize=12)
plt.title('{} Feature by Severity'.format(feature), size=14, y=1.05)
plt.show() # Display the plot for the current feature
Preparation for Machine Learning Model¶
To prepare our dataset for a machine learning model, we performed several data cleaning and preprocessing steps. These steps include dropping unnecessary features, simplifying the severity levels, and handling class imbalances through oversampling.
Finding categorical columns to drop them later
categorical_columns = data.select_dtypes(include=['category', 'object']).columns
print("Categorical columns:")
print(categorical_columns)
Categorical columns:
Index(['Severity', 'End_Time', 'State', 'Zipcode', 'Weather_Timestamp',
'Wind_Direction', 'Sunrise_Sunset', 'SeverityGroup'],
dtype='object')
target_count = 100000 # For example, oversample to 100000 instances for Severity 1 and 4
# Separate the classes
data_severity_1 = data[data['Severity'] == 1]
data_severity_4 = data[data['Severity'] == 4]
data_other = data[(data['Severity'] != 1) & (data['Severity'] != 4)]
# Oversample Severity 1 and Severity 4
data_severity_1_oversampled = data_severity_1.sample(target_count, replace=True)
data_severity_4_oversampled = data_severity_4.sample(target_count, replace=True)
# Combine oversampled data with other classes
data2 = pd.concat([data_other, data_severity_1_oversampled, data_severity_4_oversampled])
# Verify the new class distribution
print(data2['Severity'].value_counts())
# Create Severity4 column where 1 represents severity level 4 and 0 represents other levels
data2['Severity4'] = 0
data2.loc[data2['Severity'] == 4, 'Severity4'] = 1
# Verify that Severity4 is created
print(data2['Severity4'].value_counts())
# Drop the specified columns from data2
data2 = data2.drop(columns=['Severity', 'End_Time', 'State', 'Zipcode', 'Weather_Timestamp',"Year",
'Wind_Direction', 'Sunrise_Sunset', 'SeverityGroup',"Start_Time","Day","Start_Month"])
# Verify that the columns are dropped
print(data2.columns)
numeric_data = data.select_dtypes(exclude=['datetime64'])
Severity
2 546547
3 410654
1 100000
4 100000
Name: count, dtype: int64
Severity4
0 1057201
1 100000
Name: count, dtype: int64
Index(['Start_Lat', 'Start_Lng', 'Distance(mi)', 'Temperature(F)',
'Wind_Chill(F)', 'Humidity(%)', 'Pressure(in)', 'Visibility(mi)',
'Wind_Speed(mph)', 'Precipitation(in)', 'Amenity', 'Bump', 'Crossing',
'Give_Way', 'Junction', 'No_Exit', 'Railway', 'Roundabout', 'Station',
'Stop', 'Traffic_Calming', 'Traffic_Signal', 'Precipitation_NA',
'Month', 'Weekday', 'Hour', 'Minute', 'Clear', 'Cloud', 'Rain',
'Heavy_Rain', 'Snow', 'Heavy_Snow', 'Fog', 'Pressure_bc',
'Visibility_bc', 'Wind_Speed_bc', 'Severity4'],
dtype='object')
Dropping Unnecessary Features
We started by identifying and dropping features that may not be required for our model. These features could introduce noise or irrelevant information, which can negatively impact the model's performance.Simplifying Severity Levels
We simplified the severity levels by converting severity levels 1, 2, and 3 into a single category (Severity 0) and keeping severity level 4 as Severity 1. This binary classification helps in focusing on the most severe accidents and simplifies the analysis.Handling Class Imbalances through Oversampling
To address class imbalances, we performed oversampling on the minority classes (Severity 1 and Severity 4) to ensure a balanced dataset. This step is crucial for training a robust model that is not biased towards the majority class.
data2.head()
| Start_Lat | Start_Lng | Distance(mi) | Temperature(F) | Wind_Chill(F) | Humidity(%) | Pressure(in) | Visibility(mi) | Wind_Speed(mph) | Precipitation(in) | ... | Cloud | Rain | Heavy_Rain | Snow | Heavy_Snow | Fog | Pressure_bc | Visibility_bc | Wind_Speed_bc | Severity4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 226 | 39.063148 | -84.032608 | 0.01 | 36.0 | 33.3 | 100.0 | 29.67 | 10.0 | 3.5 | 0.00 | ... | True | False | False | False | False | False | 1.387175e+08 | 2.709816 | 1.298928 | 0 |
| 227 | 39.627781 | -84.188354 | 0.01 | 36.0 | 33.3 | 89.0 | 29.65 | 6.0 | 3.5 | 0.00 | ... | True | False | False | False | False | False | 1.381757e+08 | 2.148140 | 1.298928 | 0 |
| 228 | 39.758274 | -84.230507 | 0.00 | 34.0 | 31.0 | 100.0 | 29.66 | 7.0 | 3.5 | 0.00 | ... | True | False | False | False | False | False | 1.384464e+08 | 2.311444 | 1.298928 | 0 |
| 231 | 39.790760 | -84.241547 | 0.01 | 36.0 | 31.1 | 89.0 | 29.65 | 10.0 | 5.8 | 0.00 | ... | True | False | False | False | False | False | 1.381757e+08 | 2.709816 | 1.592246 | 0 |
| 232 | 39.972038 | -82.913521 | 0.01 | 37.4 | 33.8 | 100.0 | 29.62 | 3.0 | 4.6 | 0.02 | ... | False | True | False | False | False | False | 1.373662e+08 | 1.486984 | 1.457316 | 0 |
5 rows × 40 columns
Showing data in the data2 dataset
Correlation matrix¶
Correlation matrix displays the correlation coefficients between variables, ranging from -1 to 1. A value of 1 means a perfect positive correlation, -1 indicates a perfect negative correlation, and 0 means no correlation
# Step 1: Import necessary libraries
import seaborn as sns
import matplotlib.pyplot as plt
# Step 2: Select only the numeric columns for correlation calculation
data_numeric = data2.select_dtypes(include=[np.number]) # Select only numeric columns
# Step 3: Calculate the correlation matrix
correlation_matrix = data_numeric.corr()
# Step 4: Plot the correlation heatmap
plt.figure(figsize=(12, 8)) # Adjust the size of the plot
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5, vmin=-1, vmax=1)
# Step 5: Show the plot
plt.title('Correlation Heatmap for Data2')
plt.show()
Understanding the Heatmap¶
Strong Positive Correlations:¶
- Temperature and Wind Chill (0.99): Highly correlated, indicating that as temperature increases, wind chill also increases.
- Pressure and Pressure_bc (0.89): Strong correlation, showing that the Box-Cox transformation retains the original relationship.
- Visibility and Visibility_bc (0.94): High correlation, validating the transformation's effectiveness.
Moderate Correlations:¶
- Start_Lat and Start_Lng (-0.45): Moderate negative correlation, suggesting a geographic relationship.
- Temperature and Humidity (-0.31): Moderate negative correlation, indicating higher temperatures are associated with lower humidity.
Weak Correlations:¶
- Severity4 and Other Features: Weak correlations suggest that individual weather features may not strongly predict accident severity on their own.
Temporal Features:¶
- Start_Month, Month, Weekday, Day, Hour, Minute: Strong correlations among themselves but weak correlations with Severity4, indicating time-based patterns alone may not be strong predictors of accident severity.
Machine Learning: Enhancing Predictive Analysis for Road Safety¶
Machine learning is crucial for our project, enabling us to predict accident severity and identify high-risk areas. By leveraging advanced algorithms, we can uncover complex patterns and relationships within our dataset, guiding targeted interventions and improving road safety.
Logistic Regression
Below is a logistic regression model to predict accident severity using a binary classification approach (Severity 0 vs Severity 1). We started by preprocessing the data, encoding categorical variables, sampling a subset of the dataset, and addressing class imbalances through under-sampling. We also tuned the hyperparameters of the logistic regression model using GridSearchCV to find the best parameters.
# Step 1: Import necessary libraries
import numpy as np
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.linear_model import LogisticRegression
from imblearn.under_sampling import RandomUnderSampler
from sklearn.metrics import classification_report
from collections import Counter
from sklearn.preprocessing import LabelEncoder
# Step 2.1: Drop the 'Start_Time' column
if 'Severity' in data2.columns:
data2 = data2.drop(columns=["Severity"])
print("'Start_Time' column dropped successfully.")
# Step 2: Identify categorical columns
categorical_columns = data2.select_dtypes(include=['object']).columns
# Apply LabelEncoder to categorical columns
label_encoder = LabelEncoder()
for col in categorical_columns:
data2[col] = label_encoder.fit_transform(data2[col].astype(str))
# Step 3: Sample 20% of the data
data_sampled = data2.sample(frac=0.1, random_state=42)
# Step 4: Split sampled dataset into X (features) and y (target)
X = data_sampled.drop('Severity4', axis=1) # 'Severity4' is your target variable
y = data_sampled['Severity4']
# Step 5: Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
# Step 6: Under-sampling (Randomly undersample the majority class)
rus = RandomUnderSampler(sampling_strategy=0.1, random_state=42)
X_train_res, y_train_res = rus.fit_resample(X_train, y_train)
# Print the class distribution before and after resampling
print(f"Distribution of class labels before resampling: {Counter(y_train)}")
print(f"Distribution of class labels after resampling: {Counter(y_train_res)}")
# Step 7: Logistic Regression with class weights (balanced) and hyperparameter tuning via GridSearchCV
clf_base = LogisticRegression(solver='liblinear') # 'liblinear' is typically used for smaller datasets
grid = {
'C': 10.0 ** np.arange(-2, 3), # Regularization strength
'penalty': ['l1', 'l2'], # L1 or L2 regularization
'class_weight': ['balanced'] # Adjust the model to handle class imbalance
}
# Perform GridSearchCV to tune hyperparameters and cross-validation with reduced parallelization
clf_lr = GridSearchCV(clf_base, grid, cv=5, n_jobs=1, scoring='f1_macro') # f1_macro to handle imbalance
clf_lr.fit(X_train_res, y_train_res)
# Print best hyperparameters and classification report
print(f"Best Hyperparameters: {clf_lr.best_params_}")
print(f"Classification Report for Logistic Regression:\n {classification_report(y_test, clf_lr.predict(X_test))}")
# If needed, you can also access the model coefficients and intercept
coef = clf_lr.best_estimator_.coef_
intercept = clf_lr.best_estimator_.intercept_
print(f"Model Coefficients: {coef}")
print(f"Model Intercept: {intercept}")
Distribution of class labels before resampling: Counter({0: 74006, 1: 6998})
Distribution of class labels after resampling: Counter({0: 69980, 1: 6998})
Best Hyperparameters: {'C': np.float64(0.01), 'class_weight': 'balanced', 'penalty': 'l1'}
Classification Report for Logistic Regression:
precision recall f1-score support
0 0.96 0.74 0.84 31717
1 0.20 0.67 0.30 2999
accuracy 0.74 34716
macro avg 0.58 0.70 0.57 34716
weighted avg 0.89 0.74 0.79 34716
Model Coefficients: [[ 6.34132706e-02 2.16235155e-02 4.29774696e-01 1.38548730e-02
-1.29335622e-02 4.63320498e-04 -3.89300080e-04 1.97405923e-02
-1.74897543e-02 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 4.57231804e-01 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 2.51189793e-02
0.00000000e+00 -3.36056047e-01 1.80351966e-01 -1.80095538e-02
-1.71248115e-03 1.03354665e-01 -1.86324960e-03 3.99931330e-02
-3.78275961e-04 1.90460193e-01 -1.79189994e-02 -9.44800752e-02
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
-6.07153094e-09 0.00000000e+00 -8.79386804e-03]]
Model Intercept: [0.]
We used undersampling to handle class imbalance, we used the RandomUnderSampler to reduce the majority class instances. Initially, our dataset had 74,006 non-severe accidents (class 0) and 6,998 severe accidents (class 1). After undersampling, both classes had 6,998 instances. This balancing is crucial for training an unbiased model, improving its ability to predict severe accidents accurately.
Results:¶
Class Distribution: After under-sampling, the dataset had a balanced distribution of Severity 0 (70,000) and Severity 1 (7,000).
Best Hyperparameters: The optimal hyperparameters were C: 0.01, penalty: 'l1', and class_weight: 'balanced'.
Performance: The model achieved 74% accuracy, but with low precision (0.20) and moderate recall (0.67) for Severity 1, indicating room for improvement.
Conclusion: Logistic regression may not be the best model for this task, given its poor performance on the minority class. Alternative models like Random Forests or XGBoost could yield better results.
XGBOOST
XGBoost is a powerful, efficient, and scalable gradient boosting algorithm commonly used for classification and regression tasks, known for its performance in handling imbalanced datasets. We used XGBoost as an alternative to logistic regression, as the latter did not perform as expected in handling the class imbalance and achieving satisfactory results.
import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter
import matplotlib.pyplot as plt
data_sampled = data2.sample(frac=0.9, random_state=42)
# Split data into features (X) and target (y)
X = data_sampled.drop('Severity4', axis=1) # 'Severity4' is your target variable
y = data_sampled['Severity4']
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
# Under-sample to handle class imbalance
rus = RandomUnderSampler(sampling_strategy=0.1, random_state=42)
X_train_res, y_train_res = rus.fit_resample(X_train, y_train)
# Print class distribution before and after resampling
print(f"Distribution of class labels before resampling: {Counter(y_train)}")
print(f"Distribution of class labels after resampling: {Counter(y_train_res)}")
#Initialize the Gradient Boosting Model (XGBoost) with manual hyperparameters
xgb_model = xgb.XGBClassifier(
objective='binary:logistic',
eval_metric='logloss',
use_label_encoder=False,
learning_rate=0.1, # Chosen learning rate
n_estimators=100, # Number of trees
max_depth=5, # Maximum depth of trees
subsample=0.8, # Fraction of samples used per tree
colsample_bytree=0.8, # Fraction of features used per tree
random_state=42
)
# Fit the model
xgb_model.fit(X_train_res, y_train_res)
# Evaluate the model on the test set
y_pred = xgb_model.predict(X_test)
print(f"Classification Report:\n{classification_report(y_test, y_pred)}")
# Model Feature Importance
xgb.plot_importance(xgb_model)
plt.show()
Distribution of class labels before resampling: Counter({0: 666060, 1: 62976})
Distribution of class labels after resampling: Counter({0: 629760, 1: 62976})
Classification Report:
precision recall f1-score support
0 0.96 0.99 0.97 285403
1 0.79 0.58 0.67 27042
accuracy 0.95 312445
macro avg 0.88 0.78 0.82 312445
weighted avg 0.95 0.95 0.95 312445
Handling Class Imbalance:¶
Used RandomUnderSampler to balance the class distribution.
Results¶
Class 0 (majority class):
- Precision = 0.96
- Recall = 0.99
- F1-Score = 0.97
Class 1 (minority class):
- Precision = 0.79
- Recall = 0.58
- F1-Score = 0.67
Accuracy: 95% for the entire dataset
Macro Average:
- Precision = 0.88
- Recall = 0.78
- F1-Score = 0.82
Weighted Average:
- Precision = 0.95
- Recall = 0.95
- F1-Score = 0.95
Conclusion¶
Our machine learning approach, including undersampling and XGBoost, provided valuable insights into predicting accident severity. The XGBoost model showed improved performance, particularly in handling class imbalance. Key factors influencing accident severity include geographic coordinates, distance, and weather conditions. Continuously refining our models will enhance road safety by accurately identifying and mitigating high-risk areas and conditions.
neural network (Multi-Layer Perceptron) using scikit-learn's MLPClassifier.
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from imblearn.over_sampling import RandomOverSampler
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt
# Oversample the minority class
oversampler = RandomOverSampler(random_state=42)
X_train_resampled, y_train_resampled = oversampler.fit_resample(X_train_res, y_train_res)
# Standardize features
scaler = StandardScaler()
X_train_res_scaled = scaler.fit_transform(X_train_resampled)
X_test_scaled = scaler.transform(X_test)
# Define and train the neural network
mlp_model = MLPClassifier(
hidden_layer_sizes=(100, 50),
activation='relu',
solver='adam',
alpha=0.001,
max_iter=1000,
random_state=42
)
# Train the model
mlp_model.fit(X_train_res_scaled, y_train_resampled)
# Evaluate the model
y_pred = mlp_model.predict(X_test_scaled)
print(f"Classification Report for Neural Network:\n{classification_report(y_test, y_pred)}")
# Plot Loss Curve
plt.plot(mlp_model.loss_curve_)
plt.title('Neural Network Training Loss Curve')
plt.xlabel('Iterations')
plt.ylabel('Loss')
plt.show()
d:\602\Accident Dataset\.venv\Lib\site-packages\sklearn\base.py:484: FutureWarning: `BaseEstimator._check_n_features` is deprecated in 1.6 and will be removed in 1.7. Use `sklearn.utils.validation._check_n_features` instead. warnings.warn( d:\602\Accident Dataset\.venv\Lib\site-packages\sklearn\base.py:493: FutureWarning: `BaseEstimator._check_feature_names` is deprecated in 1.6 and will be removed in 1.7. Use `sklearn.utils.validation._check_feature_names` instead. warnings.warn(
Classification Report for Neural Network:
precision recall f1-score support
0 0.98 0.94 0.96 63335
1 0.56 0.82 0.67 5955
accuracy 0.93 69290
macro avg 0.77 0.88 0.81 69290
weighted avg 0.95 0.93 0.94 69290
Results:¶
- Classification Report: The classification report provides metrics such as precision, recall, and F1-score for each class, indicating the model's performance.
- Loss Curve: The loss curve shows a steep initial decrease in loss, followed by a gradual decline and stabilization. This suggests that the model is learning effectively and converging well during training.
Conclusion:¶
- Model Convergence: The loss curve indicates successful model convergence, with the loss decreasing steadily over iterations.
- Model Performance: The classification report will reveal the model's effectiveness across different classes. High precision, recall, and F1-score values suggest good performance.
- Potential Improvements: If performance is not satisfactory, consider hyperparameter tuning, advanced oversampling techniques, feature engineering, and adjusting model complexity.
As data scientists, it is crucial to interpret our results to communicate their significance effectively. This involves making the numbers understandable and actionable for our team. Our original question was: What factors contribute to the severity of accidents, and how can we predict severe accidents to improve road safety?
Interpreting the Results¶
Our analysis provided valuable insights into the factors influencing accident severity. However, it's important to understand that sometimes our results may not provide a clear answer. Nonetheless, they always offer some guidance or insight into our question.
By examining the models we used, we can predict accident severity with varying degrees of accuracy. The XGBoost model, for instance, achieved an accuracy of 95%, with a precision of 0.96 for non-severe accidents and 0.79 for severe accidents. This indicates that while the model is highly accurate overall, it performs better in predicting non-severe accidents.
Model Performance¶
XGBoost Model:
- Precision: 0.96 for class 0, 0.79 for class 1.
- Recall: 0.99 for class 0, 0.58 for class 1.
- F1-Score: 0.97 for class 0, 0.67 for class 1.
- Accuracy: 0.95.
- Insight: The model shows a good balance but needs improvement in recall for severe accidents.
Logistic Regression Model:
- Precision: 0.96 for class 0, 0.20 for class 1.
- Recall: 0.74 for class 0, 0.67 for class 1.
- F1-Score: 0.84 for class 0, 0.30 for class 1.
- Accuracy: 0.74.
- Insight: The model struggles with recall for severe accidents, indicating it may not be the best choice for this task.
Feature Importance¶
The feature importance analysis highlighted key factors influencing accident severity, such as geographic coordinates (Start_Lng, Start_Lat), distance, and weather conditions (Pressure(in), Temperature(F)). These insights are crucial for targeted interventions and resource allocation.
Next Steps¶
While our models provided valuable insights, there is always room for improvement. We can consider the following steps:
- Data Collection: Gather more data or additional features that could improve model performance, such as detailed weather conditions, traffic volume, and road infrastructure.
- Model Tuning: Continue to tune hyperparameters and explore other advanced models like Random Forest or Support Vector Machines.
- Feature Engineering: Create new features or transform existing ones to capture more nuanced relationships in the data.
Conclusion¶
Our machine learning approach provided valuable insights into predicting accident severity. The XGBoost model showed promising results, but there is still potential for improvement. By continuously refining our models and gathering more data, we can enhance road safety by accurately identifying and mitigating high-risk areas and conditions. This iterative process is a fundamental part of the data science lifecycle, and we remain committed to improving our predictions to make roads safer for everyone.
Interpretation of Results¶
References¶
Data Science Lifecycle Overview
- Data Collection: Gathering raw data from various sources. Tools: Pandas.
- Data Cleaning: Preprocessing data by handling missing values, encoding categorical variables, and removing duplicates. Tools: Pandas, Numpy.
- Exploratory Data Analysis (EDA): Analyzing data distributions and relationships to extract meaningful insights. Tools: Seaborn, Matplotlib, Plotly.
- Feature Engineering: Creating new features or selecting important ones to improve model performance. Tools: Scikit-learn.
- Model Training & Evaluation: Building and testing models using various algorithms, then evaluating their performance. Tools: Scikit-learn, XGBoost.
- Dataset Source: US Accident Dataset
Machine Learning Algorithms Used:
Key Libraries and Tools Used:
- Pandas: Data manipulation and analysis library.
- Numpy: Core library for numerical computations.
- Scikit-Learn: A machine learning library offering tools for data analysis.
- Matplotlib: A library to create visualizations.
- Seaborn: A statistical data visualization library.
- Imbalanced-Learn: A library for handling imbalanced datasets.
- Folium: A library for creating interactive maps.
- XGBoost: A highly efficient gradient boosting library.
Helpful Resources (Data Science)
Helpful Resources (Road Safety and Accident Analysis)